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ABSTRACT 
Despite  the  almost  universal  reliance  on  testing  as  the  means  of 
locating  software  errors  and  its  long  history  of  use,  few 
criteria  have  been  proposed  for  deciding  when  software  has  been 
thoroughly  tested.   As  a  basis  for  the  development  of 
pragmatically  usable  notions  of  test  data  adequacy,  an  abstract 
notion  is  proposed  and  examined,  and  approximations  to  this 
criterion  are  considered. 


1.   INTRODUCTION 

The  testing  phase  of  the  software  development  cycle  attempts 
to  expose  the  presence  of  as  many  errors  as  possible  in  a 
program,  and  ultimately  provide  the  developer  and  user  with  a 
belief  that  the  program  is  likely  to  be  correct.   The  ideal  goal 
is  to  guarantee  correctness,  but  in  all  except  very  simple  cases 
this  is  impossible  to  accomplish  through  testing  on  a  finite  set 
of  data  [11,  221.       It  is  common  for  commercially  produced 
programs  which  have  apparently  been  thoroughly  tested  to  exhibit 
incorrect  behavior  long  after  they  have  been  released  and  used. 

Testing  can  be  viewed  as  an  inference  process  in  the  course 
of  which  the  tester  attempts  to  deduce  properties  of  a  program  by 
observing  its  behavior  on  selected  inputs.   When  the  property  one 
desires  to  infer  is  correctness,  the  inputs  are  usually  selected 
to  cause  the  program  to  exhibit  all  potential  aspects  of  its 
behavior,  or  to  cover  all  facets  of  the  specification.   If  the 
selected  inputs  are  processed  correctly,  one  then  infers  that  the 
program  will  correctly  process  its  entire  input  domain. 

Two  key  problems  of  program  testing  are: 

1.  Given  a  program  and  specification,  how  to  select  data  which 
test  the  program  most  effectively. 

2.  Given  a  program,  specification,  and  test  data  which  are 
processsed  correctly,  how  to  determine  whether  or  not  the  testing 
has  been  sufficient  to  justify  a  claim  that  the  program  has  been 
adequately  tested. 

The  purpose  of  this  paper  is  to  consider  the  latter  problem. 
Myers  [16]  states : 


"The  completion  criteria  typically  used  in  practice 
are  both  meaningless  and  counterproductive.   The  two 
most  common  criteria  are 

1.  Stop  when  the  scheduled  time  for  testing  expires 

2.  Stop  when  all  the  test  cases  execute  without 
detecting  errors." 


We  shall  examine  other  proposals  for  criteria  for  test  data 
adequacy,  and  discuss  problems  associated  with  their  use  as 
practical  guides  to  whether  or  not  testing  is  complete.   In  the 
spirit  of  [7],  we  first  propose  an  ideal  criterion  for  adequacy 
and  discuss  its  practical  limitations.   We  then  consider 
approximations  to  this  adequacy  criterion. 

It  is  important  to  consider  what  the  relationship  should  be 
between  adequacy  and  test  data  selection  criteria.   One  might 
argue  that  every  test  data  selection  criterion  is  automatically 
an  adequacy  criterion,  for  we  could  simply  say  that  the  program 
has  been  adequately  tested  if  and  only  if  the  given  selection 
criterion  has  been  satisfied.   However,  most  currently  proposed 
adequacy  and  test  data  selection  criteria  represent  conditions 
which  are  necessary  for  a  program  to  be  completely  tested,  but 
certainly  are  not  sufficient.   Usually  they  are  not  even  "nearly 
sufficient",  as  it  tends  to  be  easy  to  construct  programs 
containing  errors  which  nonetheless  fulfill  each  of  these 
criteria.   In  spite  of  this,  it  is  not  uncommon  for  an  adequacy 
criterion  to  be  defined  in  terms  of  a  test  data  selection 
criterion.   For  example,  Huang  [133  advocates  the  selection  of 
data  to  traverse  each  branch  of  a  flowgraph;   once  this,  or  a 
reasonable  approximation  to  it  is  accomplished,  the  program  is 
considered  adequately  tested. 


Due  to  the  crudeness  of  presently  available  adequacy 
notions,  we  believe  that  test  case  selection  should  not  be 
directed  towards  the  fulfillment  of  the  criterion  used  as  the 
adequacy  notion.   Rather,  test  data  should  be  generated  by  some 
"reasonable  method",  and  only  when  the  tester  feels  that  the 
program  has  been  tested  on  sufficient  data  should  an 
(independent)  adequacy  criterion  be  invoked.   If  the  adequacy 
criterion  is  fulfilled,  then  the  program  is  deemed  thoroughly 
tested,  otherwise  additional  test  data  must  be  generated  and  the 
process  repeated.   Note  that  even  at  this  stage,  test  data 
selection  should  not  be  driven  by  adequacy  criteria.   In  short, 
we  should  test  to  locate  errors  ,  not  to  f ul f il 1  some  ( imperfect) 
cr i  ter ion . 

Another  important  question  to  consider  is  the  desired 
relationship  between  the  correctness  of  the  program  being  tested, 
and  the  adequacy  of  the  test  data.   Should  the  adequate  testing 
of  a  program  guarantee  its  correctness?   We  argue  that  even 
though  in  the  ideal  case  testing  stops  only  after  all  the  errors 
have  been  located  and  removed,  in  practice  such  a  requirement  of 
correctness  is  not  realistic. 

On  the  other  hand,  it  is  clear  that  the  correctness  of  a 
program  should  not  automatically  guarantee  that  an  arbitrary  test 
set  is  adequate.   One  of  the  weaknesses  of  the  notion  of  a 
reliable  and  valid  criterion  as  defined  in  [7],  is  that  for  a 
correct  program  any  criterion  is  reliable  and  valid,  and  hence 
any  set  of  test  data  is  ideal.   It  is  crucial  that  the  properties 
which  determine  the  adequacy  of  a  set  of  test  data  be  dependent 


on  the  quality  of  the  test  data  rather  than  relying  solely  on  the 
qualities  of  the  program. 

With  these  concerns  in  mind,  we  will  introduce  in  Section  3 
an  abstract  notion  of  adequacy  which  has  the  property  that  if  a 
program  has  been  adequately  tested,  then  it  is  correct,  but  the 
correctness  of  the  program  does  not  imply  that  it  has  been 
adequately  tested.   In  Section  4  we  compare  this  notion  to  other 
adequacy  criteria,  and  in  Section  5  we  consider  pragmatic 
approximations  to  this  ideal  notion  of  adequacy.   Section  6 
demonstrates  the  use  of  these  notions  on  some  examples. 


2.   EXISTING  NOTIONS  OF  TEST  DATA  ADEQUACY 

Goodenough  and  Gerhart  [7]  defined  an  id  eal  set  of  tests  to 
have  properties  which  would  imply  that  the  tests  are  capable  of 
exposing  all  errors  in  a  program.   Thus,  if  a  program  produces 
correct  results  on  a  set  of  ideal  tests,  it  must  be  correct. 
However,  these  properties  are  non-constructive  in  the  sense  that 
they  do  not  tell  us  how  to  produce  ideal  tests  for  a  given 
program.   In  addition,  it  is  generally  impossible  to  determine 
whether  a  given  set  of  tests  for  a  program  is  ideal. 

Lacking  a  guaranteed  way  to  create  tests  which  can 
conclusively  demonstrate  correctness,  software  test  developers 
need  a  method  of  detemining  when  sufficient  testing  has  been 
done.   Such  an  ad  equacy  c  r  i  ter  ion  for  test  data  should 
characterize  the  test's  ability  to  expose  errors  in  the  program. 
Many  of  the  adequacy  criteria  which  have  been  proposed  and  are  in 
use  today  approach  this  natural  goal  only  indirectly.   It  is 


common,  for  example,  to  require  that  every  statement,  branch,  or 
path  fulfilling  some  condition  be  traversed  in  order  that  test 
data  be  deemed  adequate  [13,  23],  even  though  it  has  been  pointed 
out  [4,  7,  22]  that  these  notions  of  adequacy  suffer  from 
deficiencies.   In  particular  it  is  easy  to  devise  simple  programs 
and  test  data  such  that,  even  though  the  program  contains  errors, 
the  requirements  of  each  of  these  criteria  are  fulfilled.   Since 
the  goal  of  testing  is  to  detect  the  presence  of  errors,  and 
these  notions  of  adequacy  measure  code  traversal,  it  is  not  too 
surprising  that  they  are  not  really  satisfactory  indicators  of 
how  thoroughly  the  program  has  been  tested.   Furthermore,  these 
criteria  are  themselves  untestable  in  general;   there  can  be  no 
algorithm  to  decide  of  an  arbitrary  program  whether  a  given 
statement,  branch,  or  path  can  be  traversed,  nor  whether  ev  ery 
statement,  branch,  or  path  can  be  traversed  [20]. 

Several  other  criteria  for  test  data  adequacy  have  been 
proposed  and  discussed.   Error  seeding  [6,  16]  consists  of  the 
deliberate  implantation  of  bugs  in  the  program  being  tested.   The 
new  version  of  the  program  is  then  run  on  the  set  of  test  data 
which  has  been  proposed  as  adequate  to  see  how  many  of  the 
implanted  bugs  are  exposed.   It  is  then  assumed  that  k%    of  the 
original  bugs  have  been  found  provided  that  k%  of  the  implanted 
bugs  have  been  located.   This  technique  assumes  that  the  types 
and  distribution  of  bugs  which  occur  unintentionally  are  the  same 
as  those  implanted,  a  convenient,  but  usually  inaccurate, 
assumption . 

Another  adequacy  criterion  which  has  been  proposed  is  known 


as  the  program  mutation  method  [1,  *l  ] .   This  system  makes  a 
series  of  minor  changes  to  the  program  being  tested,  creating  a 
set  of  programs  known  as  mutations.   Some  of  these  modifications 
cause  program  errors,  while  others  simply  yield  equivalent 
programs.   A  proposed  set  of  test  data  is  considered  adequate  if 
it  causes  every  inequivalent  mutation  to  give  an  incorrect  answer 
on  some  input  in  the  set.   What  the  authors  have  done  is  to 
implicitly  define  what  they  consider  to  be  the  class  of  most 
likely  simple  errors.   By  showing  that  these  errors  do  not  occur, 
they  have  not  guaranteed  the  absence  of  all  errors,  but  rather 
that  the  program  is  either  correct  or  radically  incorrect.   Since 
the  authors  assume  that  the  program  being  tested  was  written  by  a 
"competent  programmer",  i.e.   a  person  who  writes  programs  which 
are  "close"  to  being  correct,  the  second  alternative  can  be 
eliminated.   A  closely  related  system  was  implemented  by  Hamlet 
and  is  described  in  [8]. 

All  the  proposed  criteria  for  test  data  adequacy  discussed 
above  are  program-based.   That  is,  they  rely  solely  on  the 
written  code  of  the  program  being  tested.   Since  the 
effectiveness  of  test  data  depends  on  how  well  it  characterizes 
both  the  intended  and  the  actual  computation,  it  is  clear  that  a 
proposed  measure  of  that  effectiveness  should  be  based  on  both 
the  problem  specification  and  the  written  program.   It  is  now 
being  recognized  that  program  testing  techniques  should  be  based 
on  both  the  specification  and  the  program  [7,  16,  21,  22]. 
Howden  [12],  for  example,  concludes  that  "functional  and 
structural  testing  should  be  viewed  as  complementary  rather  than 


competing  techniques."  This  realization,  however,  has  not  yet 
been  applied  to  criteria  of  test  adequacy.   It  is  certainly 
important  to  develop  test  data  based  on  all  sources  of 
information,  but  even  more  important  to  use  a  notion  of  adequacy 
which  judges  the  test  data's  quality  by  considering  all  possible 
information  sources.   After  all,  test  data  derived  from  one 
source  may  coinc id ental ly  reflect  characteristics  of  other 
sources.   But  an  adequacy  criterion  is  being  used  to  judge  the 
test  data's  quality,  and  therefore  must  assess  that  quality 
against  all  sources. 


3.   AN  ABSTRACT  NOTION  OF  ADEQUACY 

Goodenough  and  Gerhart  [7]  used  the  concept  of  an  ideal  test 
as  the  basis  of  their  theory  of  program  testing.   The  theory 
which  they  proposed  describes  characteristics,  or  sufficient 
conditions,  for  a  set  of  tests  to  be  ideal,  but  does  not  provide 
a  means  of  determining  whether  the  conditions  are  fulfilled. 

Informally,  we  expect  a  test  set  to  be  adequate,  or  to 
thoroughly  test  a  program  if  the  tests  cover  all  aspects  of  the 
program's  computation.   Goodenough  and  Gerhart  suggest  that  a 
test  set  is  more  likely  to  be  ideal  if  it  takes  account  of  each 
of  the  following  factors: 

1.  Every  individual  branching  condition  in  the 
program  is  represented  in  the  tests; 

2.  Every  potential  termination  condition  in  the 
program  is  represented  in  the  tests; 

3.  Every  variable  mentioned  in  a  program  decision  is 
partitioned  correctly  into  classes  that  are  "treated 
the  same"  by  the  program; 

4 .  Every  condition  relevant  to  the  correct  operation 


of  the  program  that  is  implied  by  the  specification, 
knowledge  of  the  program's  data  structures,  or 
knowledge  of  the  general  method  being  implemented  by 
the  program,  is  represented  in  the  tests. 


These  aspects  of  a  good  set  of  tests  may  be  summarized  by 
saying  that  the  tests  characterize  the  actual  computation 
performed  by  the  program,  and  the  computation  intended  by  the 
specification.   To  capture  precisely  this  notion  of  test  set 
adequacy,  we  use  the  concept  of  program  inference ,  the  derivation 
of  a  program  from  a  sample  of  its  input/output  behavior.   Program 
testing  and  program  inference  can  be  thought  of  as  being  inverse 
processes.   The  testing  process  begins  with  a  program  and 
specification,  and  looks  for  input/output  pairs  that  characterize 
every  aspect  of  both  the  intended  and  actual  computations. 
Program  inference  starts  with  a  set  of  input/output  pairs  and 
specification,  and  derives  a  "simplest"  program  to  fit  this  given 
behav  ior . 

In  order  to  infer  a  program  from  test  data  one  would  almost 
certainly  need  several  "central"  examples  to  indicate  the  general 
pattern.   In  contrast  we  might  well  consider  it  sufficient  to 
test  a  program  on  only  one  or  two  such  "central"  test  cases.   For 
both  testing  and  inference,  however,  boundary  points  have  to  be 
explicitly  described.   For  program  inference,  it  is  clearly 
necessary  to  identify  where  each  type  of  computation  begins  and 
ends.   In  the  case  of  testing,  we  know  that  these  boundary 
points,  and  points  near  them,  are  particularly  error-prone,  and 
thus  must  be  included  in  the  test  set. 

We  let  I-r  denote  the  program  inferred  from  the  set  of  input 


data  T  and  for  programs  P  and  Q  we  write  P  hQ  (P  is  equivalent  to 
Q)  to  mean  that  P(x)  =  Q(x)  for  every  input  x  (and  hence  P(x)  is 
defined  if  and  only  if  Q(x)  is  defined.)  Formally  we  have: 

A  set  of  input  data  T  is  an  in  f erence  ad  equa  te  test  set  for 
program  P  intended  to  fulfill  specification  S  if  and  only  if: 

1 .  IT  =  P,  and 

2.  IT  =  S. 

That  is,  T  is  adequate  if  and  only  if  the  program  is  correct 
and  T  contains  sufficient  data  to  infer  both  P  and  S.   As 
indicated  above,  this  may  require  that  the  program  be  tested  on 
more  instances  of  "central"  test  data  than  might  otherwise  be 
included,  but  certainly  the  small  number  of  additional  tests  is 
not  a  significant  burden.   Our  notion  requires  that  special 
attention  be  paid  to  subdomain  boundaries.   For  example,  if  the 
specification  indicates  that  the  program  is  to  do  something  for 
all  integer  input  values  less  than  0  and  something  different  for 
values  greater  than  or  equal  to  0,  then  in  addition  to  testing 
the  program  on  a  positive  input  and  a  negative  input,  we  would 
have  to  include  values  around  the  boundary  (in  particular  -1  and 
0)  if  we  hope  to  infer  the  correct  program.   Since  experience 
indicates  that  such  boundaries  are  particularly  error  prone,  we 
consider  this  an  important  asset  of  our  definition. 

Note  that  we  only  check  the  adequacy  of  a  test  set  after  it 
fails  to  expose  errors.   This  is  consistent  with  our  view  that 
the  role  of  testing  is  to  expose  errors.   As  long  as  there  is  a  t 
in  T  such  that  P(t)  /  S(t)  there  is  no  question  that  the  test 
data  is  doing  its  job.   It  is  only  once  P(t)  =  S(t)  for  every  t 


10 


in  T  that  we  have  to  determine  whether  or  not  it  is  time  to  stop 
testing.   Ideally  the  process  ends  when  the  program  is  correct 
and  the  test  data  is  sufficient  to  determine  this. 

If  a  set  of  test  data  is  to  be  inference  adequate  as  defined 
above  then  the  test  data  must  truly  test  each  portion  of  the 
program  code  as  well  as  the  specification.   Furthermore,  the  fact 
that  T  is  adequate  means  that  ly   is  equivalent  to  both  P  and  S 
and  hence  P  is  correct.   We  thus  have  a  notion  of  test  adequacy 
which  implies  program  correctness.   Just  as  in  the  case  of 
Goodenough  and  Gerhart's  ideal  tests,  we  should  not  expect  to  be 
able  to  fulfill  this  condition  easily  or  to  be  able  to  verify  in 
general  whether  it  holds. 

An  immediate  objection  to  inference  adequacy  might  be  that 
if  we  could  really  show  that  a  program  inferred  from  data  were 
equivalent  to  the  specification,  we  would  not  have  to  write  a 
program  to  begin  with.   Why  not  simply  rely  on  such  "automatic 
programming"  systems?   Although  this  is  a  possible  methodology 
for  program  production  in  the  future,  it  does  not  appear 
realistic  today.   Existing  systems  infer  programs  in  very  high 
level  languages  such  as  Prolog  [ 1 4  ]  ,  Pure  Lisp  [15],  or  SETL  [5]. 
Programs  in  such  languages  are  frequently  very  inefficient, 
interpreted  rather  than  compiled,  and  hence  impractical  for  use 
as  production  programs. 

Several  inference  systems  have  been  implemented  ([2],  [17], 
[18],  [19])  and  the  precise  algorithm  used  to  infer  the  program 
is  not  central  to  our  discussion.   It  is  not  unusual  for  a  system 
to  be  able  to  infer  a  program  from  any  set  of  data  which  is  not 
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inconsistent.   In  such  a  system  the  only  program  inferable  might 
be  a  trivial  program  which  is  defined  only  for  the  given 
input/output  pairs.   In  that  case  it  would  not  be  unreasonable 
for  the  system  to  request  more  information  in  the  form  of 
additional  examples.   We  would  then  consider  the  test  data  as 
inadequate  for  the  program  P.   This  is  consistent  with  our 
definition,  since  Iy  would  be  undefined  and  thus  not  equivalent 
to  P. 

Much  of  the  work  on  program  inference  is  based  on  work  on 
the  theory  of  inductive  inference.   In  particular,  Blum  and  Blum 
[3]  describe  the  design  of  powerful  inductive  -  inference  machines 
which,  given  input/output  pairs  (x,y),  try  to  find  an  algorithm 
that  computes  an  extension  of  a  function  f  such  that  f(x)  =  y  for 
every  such  pair.   Their  inductive  inference  machines  function  by 
periodically  producing  an  output  which  is  its  current  hypothesis 
for  the  index  of  the  program  which  computes  f.   When  a  new  (x,y) 
pair  is  input,  a  check  is  made  to  see  whether  f(x)  =  y  for  the 
currently  hypothesized  f ,  and  if  not,  a  new  hypothesis  is 
generated  which  satisfies  all  the  pairs  received  so  far.   It  is 
hoped  that  the  inference  machine  will  eventually  converge.   Blum 
and  Blum  showed  that  a  very  wide  class  of  functions  can  be 
identifed  by  such  inductive  inference  machines.   Thus  when  we 
speak  of  the  program  inferred  from  a  set  of  input  data,  the 
reader  may  take  this  to  be  with  respect  to  some  suitable 
inductive  inference  machine.   In  particular,  we  will  frequently 
refer  to  Summers'  program  inference  system  [19]  which  can  be 
thought  of  as  an  inductive  inference  machine  in  the  theoretical 


sense  . 

To  demonstrate  the  necessity  of  including  both  requirements 
1  and  2  in  the  definition  of  an  ideal  adequate  test  set, 
consid  er : 

S(x)  =  x  mod  4 

P (x)  =  x  mod  2 

T  =  {(0,0)  ,  (1,1),  (4,0)  ,  (5,1)} 

1 1-(  x)  =  x  mod  2 

Then  P(t)  =  S(t)  for  every  t  in  T,  and  P  =  Ij,  but  S  3S  IT. 
The  removal  of  requirement  2  would  therefore  make  T  an  adequate 
test  set  even  though  the  program  is  i ncorrec t- and  the  test  data 
exposes  no  error.   This  is  true  because  the  only  portions  of  P's 
input  domain  which  have  been  tested  are  those  for  which  P 
produces  the  correct  output. 

If  instead  we  have: 

S(x)  =  x  mod  2 

P  ( x)  =  x  mod  1 

T  =  {(0,0)  ,  (1,1),  (4, 0)  ,  (5,1)1 

I-p(x)  r  x  mod  2 
then  P^I-t-,  and  hence  P  has  not  been  adequately  tested  even 
though  S  =■  I-p   This  indicates  that  there  are  portions  of  the 
actual  computation  which  have  not  been  checked.   Intuitively, 
this  is  reasonable  as  we  would  expect  an  adequate  test  set  for  a 
mod  4  program  to  at  least  include  numbers  which  are  congruent  to 
0 ,  1  ,  2,  and  3  mod  4 . 

There  are  three  distinct  types  of  difficulties  that  need  to 
be  considered  in  connection  with  a  proposed  notion  of  test  data 
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adequacy.   The  first  type  involves  un solvabil i ty  problems.   Our 
proposal  requires  the  ability  to  infer  a  program  from  data  and  to 
determine  whether  or  not  it  is  equivalent  to  the  original 
program,  even  though  program  equivalence  is  not  in  general  a 
recursively  solvable  problem.   Note,  however,  that  in  the  case  of 
each  of  the  other  previously  proposed  notions  a  similar 
unsolvable  problem  must  be  faced. 

The  second  type  of  problem  concerns  usability:   Is  it 
reasonable  to  require  the  fulfillment  of  the  criterion?   In  the 
case  of  the  code  traversal  measures,  for  example,  there  are 
frequently  too  many  paths  to  be  able  to  traverse  all  of  them. 
With  the  mutation  method,  an  n  line  program  produces  on  the  order 
of  n   mutants  which  must  be  distinguished  from,  or  shown  to  be 
equivalent  to,  the  original.   For  a  moderate  sized  program  this 
can  require  that  far  too  many  programs  be  run.   Our  proposed 
criterion  for  adequacy  encounters  a  different  type  of 
intractability.   In  particular,  the  state  of  the  art  of  program 
inference  systems  is  currently  not  well  developed,  and  there  is 
serious  doubt  that  such  systems  will  ever  be  practical  tools  for 
program  creation. 

The  third  point  to  examine  is  whether  what  the  notion 
characterizes  is  really  appropriate.   We  have  discussed  this  for 
other  proposed  criteria  and  claim  that  the  present  notion 
reflects  what  is  meant  intuitively  by  test  data  adequacy,  since 
the  goal  is  to  completely  characterize  by  test  data,  all  portions 
of  both  the  intended  and  actual  computations. 

Our  notion  of  adequacy  has  two  other  interesting 
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characteristics.   The  first  is  what  might  be  called  monotonicity. 
That  is,  if  T  is  inference  adequate  for  a  program  P  relative  to 
specification  S  and  TST',  then  T'  is  also  inference  adequate  for 
P  and  S.   Clearly,  adding  test  data  to  an  adequate  test  set 
should  not  make  the  set  inadequate.   All  of  the  adequacy  criteria 
mentioned  above  are  monotonic,  and  it  could  reasonably  be  argued 
that  in  order  for  an  adequacy  notion  to  be  appropriate,  it  must 
b  e  mono  ton  ic . 

The  other  property,  called  ex  tensional i ty  ,  is:   if  T  is 
adequate  for  P  and  P=P',  then  T  is  adequate  for  P'. 
Ex tensional ity  is  certainly  not  enjoyed  by  most  of  the  criteria 
discussed.   Given  that  P=P',  if  T  is  inference  adequate  for  P 
relative  to  S  then  obviously  T  is  also  inference  adequate  for  P' 
and  S.   In  contrast,  consider  the  programs  P  and  P'  of  Figure  1. 
Although  PSP',  and  T  =  {(0, FALSE),  (1.TRUE)}  is  adequate  for  P' 
using  the  branch  coverage  criterion,  it  is  not  adequate  for  P 
using  this  criterion  (related  questions  are  discussed  in  the  next 
section) . 

Note  that  the  ex tensional i ty  of  an  adequacy  criterion 
implies  that  the  adequacy  of  T  is  independent  of  the  actual 
implementation  P.   Thus  both  the  simplest  program  which 
implements  a  given  specification,  and  a  very  complicated 
implementation  will  require  the  same  test  data  for  adequacy. 
Since  our  notion  involves  a  complete  characterization  of  what  is 
computed,  we  feel  that  ex  ten sional i ty  is  both  a  reasonable  and  a 
desirable  property. 

Most  adequacy  criteria  are  certainly  not  extensional  simply 
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because  they  are  completely  program  (code)  dependent.   Even  for 
the  mutation  method,  the  adequacy  of  a  set  of  test  data  for  a 
program  P  does  not  imply  that  it  is  adequate  for  all  programs 
equivalent  to  P.   As  the  program  changes,  so  does  the  set  of 
mutations  which  must  be  distinguished  from  the  given  program,  and 
hence  the  test  data  to  differentiate  them. 

In  the  next  section  we  consider  other  relationships  between 
various  adequacy  criteria.   In  particular  we  are  interested  in 
studying  the  relative  strength  of  these  notions. 


4.   RELATIONS  AMONG  ADEQUACY  CRITERIA 

One  would  like  to  be  able  to  formally  compare  the  inference 
criterion  with  the  other  adequacy  criteria  discussed  earlier.   It 
has  been  shown  [20]  that  branch  adequacy  implies  statement 
adequacy.   We  would  have  liked  to  be  able  to  show  that  inference 
adequacy  implies  branch  adequacy.   Unfortunately,  however,  this 
is  not  true.   Let  P  be  a  program  containing  a  branch  which  is 
non-traver sable  (i.e.,  for  which  there  is  no  input  value  such 
that  the  branch  is  traversed).   Program  P  of  Figure  1  is  an 
example.   Then  no  set  of  test  data  can  be  branch  adequate  for  P. 
Nonetheless,  it  is  certainly  possible  to  infer  from  some  set  of 
test  data  T  a  program  P'  which  is  equivalent  to  both  S  and  P. 
Then  T  is  inference  adequate  for  P  but  not  branch  adequate  for  P. 
A  similar  argument  can  be  made  for  programs  P  which  have 
"inessential"  branches.   An  inessential  branch  is  one  which  is 
traversable,  but  preceded  by  a  decision  which  is  not  necessary. 
As  a  simple  example  consider  the  flowchart  fragment  of  Figure  2. 
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Precisely  because  the  decision  is  not  necessary  for  the 
computation,  data  can  be  selected  which  do  not  traverse  both  the 
T  and  F  branches  yet  which  sufficiently  characterize  the 
computation  to  enable  the  inference  of  a  different,  but 
equivalent  program.   This  data  would  be  inference  adequate,  but 
not  branch  adequate. 

But  it  is  not  only  these  "anomalous"  cases  that  prevent  the 
implication  from  being  true.   If,  for  example,  an  implementation 
contains  a  loop  which  has  been  "unwound"  more  times  than  the 
amount  of  data  needed  to  infer  that  portion  of  the  program,  it 
too  could  be  inference  adequate  without  being'branch  adequate. 
If  rather  than  requiring  the  inference  of  a  program  equivalent  to 
P,  we  required  the  program  P  itself  be  inferred,  this  problem 
would  be  solved,   But  that  would  eliminate  from  consideration 
programs  containing  unreachable  code,  or  inessential  branches 
even  though  we  know  that  people  do  include  (presumably 
unintentionally)  such  code  in  their  programs.   A  more  important 
problem  is  that  since  the  goal  of  most  implemented  inference 
systems  is  to  infer  the  simplest  program  consistent  with  the 
data,  such  a  system  would  only  be  usable  to  test  the  adequacy  of 
"optimal"  programs,  and  we  know  that  most  programs  do  not 
represent  the  simplest  possible  implementation  of  a 
specification.   Clearly,  without  a  precise  definition  of 
inference,  it  is  not  possible  to  formally  prove  the  desired 
theorems.   Although  the  intuition  of  our  inference  adequacy 
notion  is  independent  of  the  particular  program  inference  system 
used,  in  order  to  demonstrate  the  use  of  our  ideas  with  examples. 
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it  will  certainly  be  necessary  to  employ  some  particular 
inference  system.   In  Section  6,  we  present  four  examples  of  the 
use  of  inference  adequacy,  using  Summers'  [19]  inference  system. 
We  therefore  use  this  system  in  order  to  make  precise  the 
comparison  between  inference  adequacy  and  branch  adequacy. 

Summers  states:   "What  the  system  is  to  do  is  to  produce  the 
simplest  program  which  satisfies  the  examples."  Although  he  does 
not  formally  define  the  term  "simplest",  it  is  clear  that  at  the 
very  least  the  inferred  program  should  contain  neither 
non-traver sable  nor  inessential  branches,  and  in  fact  it  is  clear 
that  this  is  guaranteed  for  the  programs  produced  by  his  system. 
In  addition,  if  some  statements  of  a  program  P  are  deleted  to 
yield  a  new  program  P',  then  P'  is  simpler  than  P.   Thus  we  have 
the  following  theorem  which  relates  a  restricted  form  of 
inference  adequacy  and  branch  adequacy. 

Theorem :   If  T  is  inference  adequate  for  program  P  relative  to 
specification  S,  and  the  inferred  program  is  P,  then  T  is  branch 
adequate  for  P. 

Proof:   Since  P  is  the  inferred  program,  it  is  a  "simplest" 
program  which  fulfills  S  and  hence  every  decision  is  necessary. 
Thus  failure  to  include  data  to  traverse  some  branch  will  lead  to 
the  inference  of  a  "simpler"  program  P'.   Since  P  is  a  "simplest" 
program  to  fulfill  S,  P'  cannot  be  equivalent  to  P.   Hence  the 
test  data  cannot  be  inference  adequate.   □ 


A  similar  argument  proves  the  following  closely  related 


20 


resul t : 

Corol lary  :   Let  P  be  a  "simplest"  program  to  fulfill 
specification  S  and  let  T  be  a  set  of  inference  adequate  test 
data  for  P  relative  to  S.   Then  T  is  branch  adequate  for  P. 

It  is  easier  to  compare  inference  adequacy  to  Goodenough  and 
Gerhart's  notion  of  an  ideal  test.   The  following  two  theorems 
show  that  inference  adequacy  is  a  strictly  stronger  notion  than 
ideal ness. 

Theorem :   Let  P  be  a  program  intended  to  fulfill  specification  S. 
If  test  set  T  is  inference  adequate  for  P  relative  to  S,  then  T 
is  an  ideal  test  set  for  P. 

Proof ;   Since  T  is  inference  adequate  for  P,  it  follows  that  P  is 
correct  and  hence  any  test  set  is  ideal  for  P.   □ 

Theorem  :   Let  P  be  a  program  intended  to  fulfill  specification  S. 
There  exists  a  test  set  T  which  is  ideal  for  P  which  is  not 
inference  adequate  for  P  relative  to  S. 

Proof:   If  P  is  not  correct,  then  no  test  set  is  inference 
adequate  for  P  relative  to  S.   Hence  let  P  be  a  correct  program. 
Then  any  test  set,  including  the  empty  set  is  ideal  for  P.   But 
clearly  the  empty  set,  and  many  non-empty  sets,  are  not  inference 
adequate  for  the  given  program.   □ 


It  is  interesting  to  compare  the  philosophy  underlying 
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inference  adequacy  and  that  of  the  program  mutation  method  which 
we  discussed  in  Section  2.   A  primary  difference  is  that  using 
our  definition  of  adequacy,  a  test  set  is  always  considered  to  be 
adequate  or  inadequate  for  a  given  program  relative  to  a  given 
specification.   In  contrast,  as  indicated  previously,  the 
mutation  method  is  a  program-based  strategy.   Still,  the  basic 
philosophy  is  similar.   Our  notion  requires  that  sufficient  test 
data  be  generated  in  order  to  distinguish  both  the  intended  and 
actual  computations  from  al 1  nonequi v al ent  programs.   The 
mutation  method,  in  contrast,  requires  that  the  test  data  be 
sufficient  to  distinguish  the  program  from  only  some 
nonequi v al ent  programs,  namely  the  programs  which  the  authors 
have  deemed  most  likely  to  occur  if  the  original  program  is  not 
correct.   In  that  sense,  mutation  testing  may  be  thought  of  as  an 
approximation  to  our  ideal  adequacy  notion. 

In  the  next  section  we  consider  other  ways  of  approximating 
inference  adequacy  more  directly,  while  addressing  the  practical 
di  f f icul ties. 


5.   APPROXIMATIONS  TO  THE  INFERENCE  CRITERION 

The  preceding  discussion  of  the  general  difficulties  of 
using  inference  adequacy  as  a  pragmatic  criterion,  coupled  with 
the  knowledge  that  it  is  even  stronger  than  criteria  which  we 
have  previously  argued  are  not  pragmatically  usable,  make  it 
clear  that  our  notion  can  at  best  be  used  as  a  guide.   The  next 
task  is  therefore  to  consider  practically  attainable 
approximations  to  this  notion.   There  are  several  ways  of 
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proceed  ing . 

The  first  possibility  is  to  place  sufficient  restrictions  on 
the  programs  to  be  considered  so  that  questions  which  are 
unsolvable  or  intractable  in  general  are  possible  for  programs  in 
the  restricted  class.   For  example,  inference  is  feasible  in  many 
cases  for  programs  whose  behavior  may  be  modeled  by  a 
finite-state  machine.   Inference  for  such  machines  is 
accomplished  by  performing  cheeking  experiments.   In  addition, 
equivalence  is  decidable  for  finite  state  machines,  and  hence  for 
such  programs.   Nevertheless,  serious  practical  limitations  and 
difficulties  are  associated  with  such  experiments,  and  there  is 
an  extensive  discussion  of  these  problems  in  Hennie's  book  [10], 

He  points  out,  for  example,  that  for  a  machine  with  seven  states, 

I  a. 

there  are  10    possible  state  tables.   Hamlet  [9]  has  discussed 

these  limitations  vis  a  vis  testing.   We  concur  with  Hamlet's 
assessment  that  this  direction  is  not  likely  to  be  productive. 

A  second  way  to  proceed  would  be  to  look  directly  for 
practical  approximations  to  program  inference  and  equivalence, 
and  consider  the  relaxation  of  some  of  the  requirements.   One 
might,  for  example,  remove  the  requirement  that  the  inferred 
program  must  be  equivalent  to  both  the  specification  and  the 
program  being  tested.   Such  a  relaxation  would  eliminate  the 
guarantee  that  an  adequately  tested  program  is  correct.   If 
1-j.S-p,  we  say  that  T  is  prog  ram-ad  equate  ,  and  if  I  T  =  S,  T  is 
specification-adequate . 

The  decision  as  to  which  of  these  two  requirements  to  relax 
might  depend  heavily  on  the  type  of  test  data  selection  criterion 
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used.   In  general,  if  a  program-based  selection  criterion  was 
used  then  we  would  be  more  willing  to  eliminate  the  requirement 
that  Ir  must  be  shown  to  be  equivalent  to  P.   Similarly,  if  a 
specification-based  selection  criterion  was  used,  then  the  I-r  =  S 
requirement  might  reasonably  be  eased.   In  either  case  we  are 
left  with  determining  at  least  one  equivalence.   Pragmatic  use  of 
the  definition  of  inference  adequacy  is  therefore  dependent  upon 
finding  a  way  around  the  fact  that  determining  equivalence  of 
arbitrary  programs  and  specifications  is  undecidable.   The 
testing  process  itself  is  an  attempt  to  overcome  this  general 
und ecid ab il ity  . 

The  assessment  of  specification  adequacy  can  be  made  easier 
by  producing  I-j-in  a  very  high  level  language  such  as  SETL  [53, 
Pure  Lisp  [15],  or  Prolog  [ 1 H  ] .   Although  the  general  equivalence 
problem  is  still  undecidable,  a  major  virtue  of  writing  programs 
in  such  languages  is  that  the  programs  look  very  much  like  the 
specifications  and  hence  it  is  easier  to  determine  equivalence. 

In  the  case  of  determining  program-adequacy,  we  can 
approximate  checking  for  equivalence  by  the  following  technique 
which  is  essentially  an  extension  of  testing  to  I  j,    and  has  the 
benefit  of  suggesting  additional  tests  for  the  original  program 
if  T  is  not  adequate.   It  may  also  indicate  the  type  of  error 
present  if  the  program  is  incorrect. 

Suppose  we  have  specification  S,  program  P,  and  test  set  T 
such  that  P(t)  =  S(t)  for  every  t  in  T.   I-r  is  the  program 
inferred  from  T.   To  judge  whether  T  is  an  adequate  test  set,  we 
generate  an  additional  set  R  of  tests  by  some  means,  possibly  by 
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random  selection.   We  require  only  that  the  tests  in  R  be 
independent  of  those  in  T.   The  next  step  is  to  test  P  on  R. 

It  is  worthwhile  considering  the  implications  of  the 
possible  outcomes  of  the  additional  tests  R.   If  P(r)  ^  S(r)  for 
some  r  in  R,  then  P  is  not  correct  and  testing  must  continue. 
Assuming  P(r)  =  S(r)  for  every  r  in  R,  we  now  test  IT  on  R.   If 
I_(r)  =  P(r)  for  every  r  in  R,  then  T  is  accepted  as  an  adequate 
test  set  for  P  and  S.   This  may  be  thought  of  as  approximating 
the  equivalence  of  I-p  and  P.   If  I-p(r)  ^  p(p)  for  some  r  in  R 
then  IT  is  incorrect  on  some  elements  of  R,  since  P(r)  =  S(r)  for 
all  r  in  R.   This  indicates  that  we  have  not  tested  P 
sufficiently  since  I-p  5£  P.   Thus  we  must  continue  testing.   In 
particular,  test  data  should  be  similar  to  the  elements  of  R 
where  ITwas  incorrect,  since  that  part  of  the  problem's  domain 
was  not  characterized  sufficiently  well  by  the  original  tests.   A 
new  program  I-p-/  must  then  be  inferred  from  the  augmented  test  set 
T'  and  the  process  repeated. 

An  interesting  and  important  question  to  consider  is  what  is 
a  reasonable  size  for  R.   Obviously  if  R  contains  only  one  piece 
of  test  data  we  feel  far  less  assured  than  if  R  contains  many 
pieces  of  test  data.   As  a  rule  of  thumb,  we  suggest  the 
requirement  that  IR|  >  fT  I  . 


6.   EXAMPLES 

In  this  section  we  demonstrate  the  application  of  our 
adequacy  notion  using  an  example  drawn  from  Summers  [19]  with 
some  simple  variations.   Summers'  system  was  selected  because  it 
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is  a  real,  implemented  inference  system  which  infers  programs  in 
basic  LISP.   The  programs  being  tested  are  written  in  PL/I.   If 
the  specification  indicates  that  a  program  is  to  be  undefined  (or 
produce  no  output)  for  a  given  input  we  use  the  notation  "-"  in 
the  output  portion  of  an  input/output  pair.   This  situation 
occurs  in  example  2. 

As  discussed  earlier,  in  each  case  S(t)  =  P(t)  for  all  t  in 
the  test  set  T.   In  order  to  approximate  the  determination  of  the 
equivalence  of  I-p  to  S  and  P,  we  generate  a  set  of  random  input 
test  data  R.   The  values  {7,  1,  6,  8),  to  be  used  as  lengths  of 
input  strings,  were  generated  by  using  the  SETL  [5]  random  number 
generator,  requesting  four  integers  between  1  and  15. 


Ex  ample  _1_ 

(This  example  uses  Summers'  specification,  test  data,  and 

inferred  program.   We  provide  the  PL/I  program  P1.) 

S:   The  program  accepts  as  input  a  list  X  of  arbitrary  length  n, 
and  returns  the  first  half  of  X  if  n  is  even.   If  n  is  odd,  the 
program's  output  is  undefined. 

P1:  half:   procedure  options  (main); 

del  instring  char(80)  varying,  outstring  char(IO)  varying; 

del  (inlength,  halflength)  fixed; 

put  list  ('type  $  to  terminate  program'); 

get  list  (instring); 

inlength  =  length( instr ing) ; 

do  while  (inlength  /  '$'); 

put  list  (instring);   put  skip; 
halflength  =  (inlength  +  1 ) /2 ; 
if  inlength  =  (2  *  halflength) 
then  do ; 

outstring  =  substr  ( instr ing  ,  1 , hal fl ength) ; 
put  list  (outstring);   put  skip; 
end  ; 
get  list  (instring); 
inlength  =  length  (instring); 
end  ; 
end  half; 
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T:   {((),()),  ((AB),(A)),  ( (ABCD)  ,(AB)  )  ,  ( ( ABCDEF)  ,  ( ABC ) )  } 

IT:  half [ x]  <; —  h[x; x] 

h[x;y]  < —  [atom[y]  — »  nil; 

T  — ^  con s[ car [ x] ;h[ cdr [ x]  ;  cddr [  y  ]  ]  ]  ] 

S(r)  =  P1(r)  =  IT(r)  for  all  r  in  R.   We  thus  consider  T 
adequate  to  test  P1  relative  to  S.   Since  output  is  undefined  for 
odd  n,  it  is  intuitively  reasonable  to  have  only  even  length 
tests.   In  fact,  we  do  not  care  what  output  is  produced  for  odd 
1 eng th  inputs. 

Exampl  e  2_ 

S:   As  in  Example  1 

P2:  half:   procedure  options  (main); 

del  instring  char(80)  varying,  outstring  char(40)  varying; 

del  halfl  fixed; 

put  list  ('type  $  to  terminate  program'); 

get  list  (instring); 

halfl  =  (length  (instring)  +  1)  /  2; 

do  while  (halfl  /  *$  '  )  ; 

put  list  (instring)  skip;   put  skip; 

outstring  =  substr  ( ins tr ing , 1 , hal f 1 ) ; 

put  list  (outstring)  skip;   put  skip; 

get  list  (instring); 

halfl  =  (length  (instring)  +  1)  /  2; 
end  ; 
end  half; 

T:   As  in  Example  1 

I—:  As  in  Example  1 


In  this  case  the  program  being  tested,  P2,  and  the  inferred 
program  I-p  do  not  match  on  two  of  the  inputs  of  R.   In  particular 
P2((  ABCDEFG))  =  (ABCD)  and  P2((A))  =  (A),  while  I  -r(  ( ABCDEFG )  )  and 
I-r((A))  are  undefined.   This  indicates  that  the  test  data  has  not 
sufficiently  characterized  the  actual  computation  and  thus  is  not 
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adequate.   Since  P2  (  (  ABCDEFG  )  )  =*=  S  (  (  ABCDEFG  )  )  and  P2((A))  ^ 
S((A)),  we  see  that  it  is  the  original  program  P2  which  is 
incorrect.   The  tester  should  now  observe  that  the  program  is 
incorrect  for  strings  of  odd  length  and  therefore  be  pointed  to 
the  appropriate  place  to  modify  the  program.   We  next  correct  the 
program  and  assume  it  is  now  P1  of  Example  1.   We  shall  call  this 
modified  program  P'.   P'  is  now  tested  on  T'  =  T  U  R  =  {((),()), 
((A),-),  ((AB),(A)),  (  (ABCD) ,( AB) )  ,  (  ( ABCDEF )  ,  ( ABC )  )  , 
((ABCDEFG),-),  (  (ABCDEFGH) ,(ABCD))} .   Since  P'(t)  =  S(t)  for  all 
t  in  T',  we  now  check  I-p(t)  for  each  t  in  T'.   Since  in  this 
example  I-j.(t)  =  S(t)  for  every  t  in  T',  no  new"  program  need  be 
inferred . 

We  must  now  generate  a  new  set  of  random  test  data  P.'. 
Since  JT'I  =  7,  we  generate  seven  random  numbers  between  1  and  15 
to  use  as  our  test  set  to  approximate  equivalence.   The  set  of 
numbers  generated  was  {15,  9,  1 1 ,  5,  6,  H ,  13)  and  P'(r)  =  S(r)  = 
If(r)  for  every  r  in  R'.   The  corrected  program  P'  is  thus 
considered  adequately  tested  by  T'. 

Ex  ample  3_ 

S:   The  program  accepts  as  input  a  list  X  of  arbitrary  length  n, 
and  returns  a  list  consisting  of  the  first  half  of  X  if  n  is 
even.   If  n  is  odd,  the  program  outputs  the  first  (n+1)/2 
elements  of  X. 

P3:  P1  of  Example  1 

T:   As  in  Example  1 

I-j-:  As  in  Example  1 


In  this  case  P3 ( (ABCDEFG ) )  and  P3((A))  are  undefined  while 
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SC(ABCDEFG))  =  (ABCD)  and  S((A))  =  (A).   Thus,  some  of  the 
randomly  generated  data  indicate  that  the  program  P3  is  incorrect 
and  hence  certainly  not  adequately  tested.   Note  that  in 
addition,  I  r(  ( ABCDEFG )  )  and  I-j.((A))  are  undefined.   This 
indicates  that  the  test  data  did  not  sufficiently  characterize 
the  intended  computation.   As  in  Example  2,  the  program  does  not 
agree  with  the  specification  on  odd  length  lists.   The  program  is 
then  corrected  to  P',  where  P'  is  program  P2  of  Example  2. 

This  new  program  must  now  be  tested.   The  test  set  for  P' 
must  certainly  include  the  original  test  data  T  as  well  as  the 
randomly  generated  test  data  R  which  exposed  the  error  in  P3.   We 
might  want  to  include  additional  test  data  as  well,  particularly 
additional  odd  length  lists  and  thus  we  select  the  following  test 
set  which  contains  all  lists  of  length  up  to  8. 
T»:   {((),()),  ((A), (A)),  ((AB).(A)),  (  (ABC)  ,(AB)  )  , 
((ABCD), (AB)) ,  (  (ABCDE).(ABC)  )  ,  (  ( ABCDEF )  ,  ( ABC )  )  , 
( (ABCDEFG) .(ABCD) ) ,  ( (ABCDEFGH) , (ABCD) ) } 

In  this  case,  P'(t)  =  S(t)  for  all  t  in  T*  as  required. 

Summers'  system  would  then  infer  the  program  I-r.: 

half[x]  < —  h[ x; x] 

h[x;y]  4—  [atom[y]  — *  nil; 

atom[cdr[y]]  — £  con s[ car [ x ]  ;n il ]  ; 

T  — ■>  const  car  [  x]  ;h[  cdr[  x]  ;cddr[  y]  ]]  ] 


Again  additional  random  test  data  R'  is  generated  with 

|R'I  =  9.  P'(r)  =  S(r)  =  I-j-i(r)  for  each  r  in  R1.   The  corrected 

program  P'  is  thus  considered  adequately  tested  by  the  test  data 
T'. 
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Ex  ample  4_ 

S:  As  in  Example  3 
Pi»:  P2  of  Example  2 
T :  As  in  Ex  ample  1 
I-— :  As  in  Example  1 

Using  the  same  set  of  random  data  R  as  in  the  previous 
examples,  we  see  that  S(r)  =  P'J(r)  for  all  r  in  R,  but 
IT(  (ABCDEFG) )  and  IT((A))  are  undefined.   In  addition, 
SC(ABCDEFG))  =  P 1 (( ABCDEFG ))  =  (ABCD)  and  S((A))  =  P1((A))  =  (A). 
This  indicates  that  T  is  not  sufficient  to  adequately  test  P4 
relative  to  S.   Unlike  the  situation  in  Examples  2  and  3,    the 
additional  test  data  of  R  does  not  indicate  an  error  in  P4.   We 
note  that  whereas  all  the  input  lists  in  T  were  of  even  length, 
two  of  the  lists  of  R  were  odd,  and  I--  produced  the  incorrect 
output  for  these  inputs.   This  indicates  to  us  that  our  test  set 
T  must  be  augmented  by  some  lists  of  odd  length.   Our  new  test 
set  T '  is : 

T':   {((),()),  ((A), (A)),  CUB), (A)),  (  (ABC)  ,  (AB))  , 
( (ABCD)  ,(AB)  )  ,  (  (ABCDE)  .(ABC)  )  ,  (  ( ABCDEF)  , ( ABC)  ) } 

The  inferred  program  I-j-i  is  then  the  same  as  Ij/  of  Example 
3.   Again,  we  must  generate  random  test  data.   Using  the  test  set 
R'  of  Example  2,  we  have  that  P*»(r)  =  S(r)  for  each  r  in  R'.   We 
next  run  I-j-/  on  the  input  values  of  R'  and  this  time  find  that 
I-]-'  produces  the  intended  output  for  each  of  these  inputs.   Thus 
T'  is  deemed  adequate  to  test  P*J  relative  to  S. 
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7.   CONCLUSIONS 

We  have  introduced  an  idealized  notion  of  test  data  adequacy 
which  requires  that  the  test  data  be  sufficient  to  infer  both  the 
intended  and  actual  computations.   We  pointed  out  pragmatic 
limitations  of  this  notion  and  considered  plausible 
approximations  to  the  ideal  requirements.   In  particular,  we 
considered  ways  of  approximating  the  determination  of  equivalence 
of  programs  . 

Although  there  exist  several  implemented  inference  systems, 
and  we  in  fact  demonstrated  our  adequacy  criterion  using  one,  it 
is  not  clear  that  such  systems  will  ever  be  practically 
available.   Therefore,  just  as  we  used  testing  with  random  data 
as  an  approximation  to  the  (unsolvable)  problem  of  determining 
equivalence,  it  would  be  both  reasonable  and  interesting  to 
attempt  to  develop  practical  approximations  to  program  inference. 
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